Picture for Joe Benton

Joe Benton

Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning

Add code
May 22, 2026
Viaarxiv icon

Efficiently Aligning Language Models with Online Natural Language Feedback

Add code
May 05, 2026
Viaarxiv icon

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Add code
Jul 15, 2025
Figure 1 for Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Viaarxiv icon

Reasoning Models Don't Always Say What They Think

Add code
May 08, 2025
Figure 1 for Reasoning Models Don't Always Say What They Think
Figure 2 for Reasoning Models Don't Always Say What They Think
Figure 3 for Reasoning Models Don't Always Say What They Think
Figure 4 for Reasoning Models Don't Always Say What They Think
Viaarxiv icon

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Add code
Jan 31, 2025
Figure 1 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 2 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 3 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 4 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Viaarxiv icon

Sabotage Evaluations for Frontier Models

Add code
Oct 28, 2024
Figure 1 for Sabotage Evaluations for Frontier Models
Figure 2 for Sabotage Evaluations for Frontier Models
Figure 3 for Sabotage Evaluations for Frontier Models
Figure 4 for Sabotage Evaluations for Frontier Models
Viaarxiv icon

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

Add code
Jul 21, 2024
Figure 1 for When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
Figure 2 for When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
Figure 3 for When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
Figure 4 for When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
Viaarxiv icon

Measuring Feature Sparsity in Language Models

Add code
Oct 13, 2023
Figure 1 for Measuring Feature Sparsity in Language Models
Figure 2 for Measuring Feature Sparsity in Language Models
Figure 3 for Measuring Feature Sparsity in Language Models
Figure 4 for Measuring Feature Sparsity in Language Models
Viaarxiv icon

Linear Convergence Bounds for Diffusion Models via Stochastic Localization

Add code
Aug 07, 2023
Figure 1 for Linear Convergence Bounds for Diffusion Models via Stochastic Localization
Figure 2 for Linear Convergence Bounds for Diffusion Models via Stochastic Localization
Viaarxiv icon

Error Bounds for Flow Matching Methods

Add code
May 26, 2023
Viaarxiv icon